-
-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change exponential backoff algorithm for federation send #4597
Conversation
@@ -17,7 +17,7 @@ export RUST_BACKTRACE=1 | |||
|
|||
if [ -n "$PACKAGE" ]; | |||
then | |||
cargo test -p $PACKAGE --all-features --no-fail-fast | |||
cargo test -p $PACKAGE --all-features --no-fail-fast $TEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can run a single test with this without manually editing the script each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
Right now, the base of the exponent is 2. This means that for any downtime of X hours, the maximum time until the federation resumes is another X hours - basically at maximum doubling the downtime. The expected / average value should be half the downtime I think (due to discretization). Maybe just changing the constant to e.g. Here's a simulation of downtimes with different exponential delays:
|
Instance Downtime | Federation Resumed At (Including Downtime) | Total Retry Attempts |
---|---|---|
1m 0s | 1m 3s | 6 |
10m 0s | 17m 3s | 10 |
30m 0s | 34m 7s | 11 |
1h 0m | 1h 8m | 12 |
2h 0m | 2h 16m | 13 |
3h 0m | 4h 33m | 14 |
6h 0m | 9h 6m | 15 |
12h 0m | 18h 12m | 16 |
24h 0m | 36h 24m | 17 |
48h 0m | 72h 49m | 18 |
1.5**retries
Instance Downtime | Federation Resumed At (Including Downtime) | Total Retry Attempts |
---|---|---|
1m 0s | 1m 15s | 9 |
10m 0s | 14m 34s | 15 |
30m 0s | 32m 49s | 17 |
1h 0m | 1h 13m | 19 |
2h 0m | 2h 46m | 21 |
3h 0m | 4h 9m | 22 |
6h 0m | 6h 14m | 23 |
12h 0m | 14h 1m | 25 |
24h 0m | 31h 33m | 27 |
48h 0m | 71h 1m | 29 |
1.25**retries
Instance Downtime | Federation Resumed At (Including Downtime) | Total Retry Attempts |
---|---|---|
1m 0s | 1m 9s | 13 |
10m 0s | 11m 14s | 23 |
30m 0s | 34m 24s | 28 |
1h 0m | 1h 7m | 31 |
2h 0m | 2h 11m | 34 |
3h 0m | 3h 25m | 36 |
6h 0m | 6h 41m | 39 |
12h 0m | 13h 3m | 42 |
24h 0m | 25h 30m | 45 |
48h 0m | 49h 49m | 48 |
Also I was actually thinking that it might be smart to go the opposite direction than this PR and remove the "dead" flag altogether and instead just keep exponentially increasing the retry delays for all instances - maybe with a limit of a week or so.
Capping at an hour I guess kinda works, then we bascially have an exponential delay between 0s and 1h, then fixed at every 1h for a day, and after that fixed at 24h (the dead check). But reducing the base of the exponentiation seems cleaner.
I don't think that's true, the list of instances is fully refreshed in the loop in start_stop_federation_workers, which happens every INSTANCES_RECHECK_DELAY (once per minute) |
That seems like a good idea, it means there is one less value to check and less complexity. However I dont like the values you are showing. If an instance is down for 3 hours it will be defederated for 6.5 hours before federation resumes. That doesnt make sense because its extremely cheap to send an inbox request, so we can easily do it once per hour during the entire first day. For the limit one day should be fine, its what we are using now as well.
True, I missed the loop there. |
I think you might be misreading my tables (they are a bit confusing), the second column is the total duration including the downtime, so for the 3 h downtime the additional delay until federation resumes would be:
With 1.25^N the delay until refederation is around or less than your fixed 1h limit all the way up to an instance downtime of > 24h |
Okay in that case it sounds good, Ive changed it to 1.25^n. And I also changed it to ignore the first error, so it doesnt sleep over a second with only a single failure. Instead there need to be at least two failures before it starts sleeping. I also thought about getting not marking instances as dead anymore. But then we would have to keep thousands of federation workers around for dead instances which do nothing but sleep for a very long time. We could instead mark instances as dead based on |
@@ -17,7 +17,7 @@ export RUST_BACKTRACE=1 | |||
|
|||
if [ -n "$PACKAGE" ]; | |||
then | |||
cargo test -p $PACKAGE --all-features --no-fail-fast | |||
cargo test -p $PACKAGE --all-features --no-fail-fast $TEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
Lemmy.world had a problem with incoming federation where
/inbox
requests were failing, soFederationQueueState::fail_count
kept increasing. On lemmy.ml the value reached 16, which results in a sleep of 18 hours before the next activity send would be attempted. This is way too long so that it was necessary to reset the count manually.This shouldnt be necessary, Lemmy shouldnt wait such a long time so Ive added a limit of one hour. Alternatively it may be possible to calculate the sleep interval differently instead of
2^n
but I didnt find any resources for exponential backoff which match our use case.@phiresky I noticed another problem with the outgoing federation workers. It looks like the list of dead and alive instances is never reloaded after starting. So if an instance is marked dead when Lemmy starts, it will never attempt to send any activities until Lemmy is restarted. Similarly, if an instance is alive at the start and marked as dead later, Lemmy will still attempt to send activities forever, according to the sleep interval. Am I missing anything, or do you have an idea how to fix this?